MIE 1517 Progress Report - Team 9

1. Introduction

This project aims to enhance workplace safety in industrial and construction environments through a real-time moitoring system. The system will leverage advanded computer vision techniques, like YOLO, to ensure that workers are wearing personal protective equipment (PPE) and staying safe around hazardous tools and areas. The project has the following three phases planned:

  1. PPE Detection
  2. PPE Compliance Verification
  3. Hazard Zone Detection and Proximity Alert System

Work Flow

1.1 System Overview

  1. The workflow begins with an image frame input, typically sourced from a camera. This frame is processed by YOLOv11, which performs two key tasks: object detection (identifying persons and PPE elements like helmets or vests) and pose estimation (identifying keypoints on the human body). The output includes the location and classification of detected objects, along with pose information.

  2. Next, the system generates a normalized depth map using MiDaS, a state-of-the-art depth estimation model. MiDaS uses an encoder-decoder architecture (e.g., a BEiT Encoder followed by a Decoder) to produce a relative depth map from a single RGB frame. However, this depth map is not in real-world units.

  3. To address this, a calibration step is performed. The system detects a known reference marker, such as an AprilTag, placed at a known distance from the camera. By comparing the AprilTag’s real-world depth to the MiDaS-generated normalized depth values, a real depth conversion factor is calculated. Applying this factor transforms the normalized depth map into approximate real distances. After that, Key points extracted from YOLO pose will be used to determine the depth of target.

  4. Finally, the compliance check examines whether the detected person is wearing the required PPE and is outside of a defined “danger zone.” If someone is within a hazardous distance without proper PPE, the system triggers an alarm as a warning.

2. Data Collection

2.1 Loading Data

2.2 Brief Visualization

3. Models

3.1 YOLO11s

Training

Result

Model Summary

YOLO11s, with 319 layers and 9,429,727 parameters**, is relatively lightweight. At 6.3 GFLOPs, it’s computationally efficient and suitable for faster inference on limited hardware while maintaining high accuracy.

Overall Detection Performance


Class-specific Performance


Insights

  1. Training vs. Testing Performance:

    • The drop in performance from training to testing is minor, highlighting good generalization.
    • Classes like No Helmet and No Vest exhibit relatively larger performance gaps, indicating a need for further refinement in these categories.
  2. AUC Curves:

    • Most AUC curves for both training and testing datasets approach the top-right corner, reflecting strong overall performance.
    • The No Helmet class exhibits a lower AUC curve, showing room for improvement in detecting this category.
  3. Focus on Recall:

    • Recall is critical for this application to ensure that every safety violation is detected, minimizing risks.
  4. Real-world Examples:

    • The model performs well in real-world scenarios, handling complex cases like overlapping individuals and distinguishing PPE from similar non-PPE items (e.g., differentiating PPE helmets from ski helmets).

Conclusion

The YOLO11s model demonstrates strong performance in detecting PPE compliance, with high precision and recall across training and testing datasets. While it generalizes well, improvements are needed for No Helmet detection, which can be addressed with targeted data augmentation and compliance methods. Overall, the model is effective for real-world safety monitoring applications, ensuring critical safety violations are identified reliably.

                                                             Training Set                                                                               
PR Curve Confusion Matrix
                                                             Testing Set                                                                               
PR Curve Confusion Matrix

Some Examples

Example 1 Example 2

3.2 PPE Compliance Check

The compliance check function utilizes the "has_overlap function" which checks if a detected - person's bounding box sufficiently overlaps with any of the bounding boxes for items like helmets or vests*

3.3 MiDaS - Monocular Depth Estimation

With PPE compliance check, the system is able to evaluate whether workers are correctly wearing PPE. Then building on PPE compliance check, a pre-trained MiDas is used to detect how far worker is to the hazardous and dangerous areas, if worker without PPE compliance, the system automatically triggers an alert, providing a timely warning.

Here is a sample code to predict the depth map of the image using pre-trained MiDas.

The above results indicated that MiDaS is cable to understand and predict the spatial depth of objects and scenes from a single image input. Usually darker shades represent greated depth relative to the camera and lighter shades represent smaller depth relative to the camera.

Hence we utilized a pre-trained MiDaS model to generate relative normalized depth maps and obtain relative depth of woker to the camera. The following steps summarize the process and results:

  1. Calibration with QR Codes:

    • Since MiDas produces relative depth from the object in the image to the camera, we use calibration with QR codes to figure out the absolute depth (real-world measurement).
    • Two QR codes with known real-world distances were used to calibrate the relative depth map into a real-world depth map by measuring their proportional norm.
    • Result: The calibration process produced a mean absolute error (MAE) of 1.3 meters, which is acceptable for monocular depth estimation.
  2. Performance Comparison: The following table compares the MiDaS estimated depths to the ground truth values:

Name MiDaS Estimated Depths (m) Ground Truth Depths (m)
Tag 0 4.9 3.4
Tag 1 3.6 2.5
  1. Challenges in Depth Estimation:

    • Depth variability occurs in different body parts of a person, leading to inconsistent depth measurements.
  2. Enhancement with YOLO Pose Estimation:

    • We applied YOLO pose estimation to extract a few keypoints from the detected person (e.g., shoulders, knees).
    • The average depth of these keypoints was computed as a more reliable estimate of the overall depth of the person.
  3. Processing Time:

    • The process is computationally expensive, taking approximately 1–1.5 seconds per frame due to the additional steps of depth calibration and keypoint-based refinement.

This method improves the reliability of monocular depth estimation by incorporating pose-based refinement, although further optimization is needed to reduce processing time for real-time applications.

Depth Map Depth Map Diagram

Sample Code for Calibration

Sample Code for Depth Estimation on Detected People

Monitor Worker's Proximity to Hazardous Area

With the ability to convert relative depth to real depth with acceptable errors, in order to monitor worker's proximity to hazardous area, we defined a real-world distance threshold that when worker's depth is under that threshold while without PPE compliance, a timely warning is triggered.

For the following functions:

plot_bboxes function is responsible for visualizing detected workers within a frame. It overlays bounding boxes, compliance labels, depth information, and warning icon is displayer when non-compliant workers are too close to danger zones.

And overlay_warning_image function simply overlays a warning icon onto the frame at the location of a detected worker who is too close to a danger zone.

Results:

Depth Map

Depth Map Depth Map

Depth Map

Reproducing the results

For detailed instructions on generating videos like the presentation demos, refer to the GitHub README.

4.1 PPE Detection with YOLO

Several projects have utilized YOLO architectures for real-time PPE detection:

These projects demonstrate the effectiveness of YOLO models in accurately identifying PPE in real-time, contributing to improved safety compliance in various industries.

4.2 Depth Estimation with MiDaS

MiDaS has been instrumental in monocular depth estimation:

These past projects explored MiDaS's capability in providing depth estimation from single images. Collectively, these projects underscore the significant progress in PPE detection and depth estimation

5. Discussion

5.1 Overall results

The YOLO11s model exhibits strong performance overall, particularly in its ability to balance computational efficiency and accuracy for PPE compliance monitoring. Achieving an AUC score of 0.965 during training and 0.842 on testing data highlights its capability to generalize effectively to unseen scenarios. The performance drop between training and testing datasets is relatively small, suggesting the model is well-trained and not overly fitted to the training data.

One unusual and interesting result is the consistent challenge in detecting the “No Helmet” category. Despite the high overall performance, this class lags behind others, with testing accuracy dropping to 0.71%. This is likely due to imbalanced training data or visual ambiguities, such as occlusions or similarities with non-PPE objects like casual caps. Addressing this issue through targeted data augmentation or model refinement could significantly enhance its detection capabilities.

Another noteworthy insight is the model’s ability to handle complex real-world scenarios, such as distinguishing PPE items from visually similar objects and managing overlapping individuals. These findings underscore the model’s practicality for real-world safety monitoring, though further optimization is needed to improve computational efficiency for real-time applications.

While the depth estimation process using MiDaS and YOLO pose estimation provides additional context, one surprising result is the lack of accuracy in converting the relative depth map to a real depth map. Despite using MiDaS’s largest model and calibrating with two AprilTags of known depth, the mean squared error (MSE) remains at 1.3 meters. This result highlights limitations in the calibration process and suggests the need for improved methods to enhance depth accuracy.

Computational optimization remains a key priority to reduce the 1–1.5 seconds per frame processing time.

5.2 What did we learn

Catastrophic Forgetting:
This occurs when a neural network forgets previously learned information. The trained YOLO model can only detect classes that appear in the training dataset. Our initial training data only include helmet and vest classes, and to enable the system to detect different classes of interest so that PPE compliance can be checked, we came up with a few solutions:

The method we use is to switch to a dataset with PPE and person calsses, and use a self-designed & rule-based algorithm to check PPE compliance.

Capabilities of YOLOV11
Using the newst version of YOLO model, YOLO11 introduces significant improvements in architecture and training methods, making it a versatile choice for a wide range of computer vision tasks such as

And to be notice that usually a model with larger size of parameters come with better performance.

Midas Strengths & Limitations:
One of the key learnings from this project was understanding the strengths and limitations of MiDaS for depth estimation.